# A Large-scale Longitudinal Multimodal Dataset of State-backed Information Operations on Twitter

## Dataset Articles

This dataset is collected and processed according to the paper "A Large-scale Longitudinal Multimodal Dataset of State-backed Information Operations on Twitter".

## Dataset File Names
The data files in this dataset are named with the following pattern: `<data type>` `<subdataset>` `<class>`.json ; where `<data type>` is either `tweets` or `users` that specifies whether the file contains tweet items or user items. The `<subdataset>` includes `<country>` and `<time>`. The `<country>` is used for representing the country or countries this dataset is affiliated with, and the `<time>` corresponds to the time Twitter released this data. The `<class>` is either `positive` or `negative`, corresponding to `state-sponsored` data and `background` data respectively.

## Data structure
The data in our dataset is in json format. Each row in the files contain information about one tweet or one user. The structure of our data is as follows:
### Tweet Item
1. tweet id, the id of tweet
2. user id, the id of the user who has published the tweet. For tweets in positive dataset, the id is hashed. In the negative datasets, the id refers to the original Twitter user id.
3. subdataset, the subdataset this item belongs to which should be same as the `<subdataset>` of the file name.
4. class, the class of the tweet where `positive` means `state-sponsored` and `negative` means `background`.
5. name, the screen name of the user who publish the tweet. For tweets in positive dataset, this value is hashed by Twitter.
6. tweet time, the date and time when the tweet was published in the format of `<year>`-`<month>`-`<day>` `<hour>`:`<minute>`.
7. account lang, the language chosen by the user.
8. tweet lang, the language of the tweet.
9. \# of likes, the number of likes of the tweet.
10. \# of retweets, the retweet number of the tweet.
11. hashtags, the list of hashtags in the tweet.
12. urls, the list of URLs in the tweet.
13. mentions, the list of ids of the users mentioned iy the tweet.
14. images, the list of images in the tweet. For tweets in the negative dataset, this is the list of URLs of images, and for tweets in the positive dataset, this is the list of file names provided by Twitter.
15. image hashes, the list of hash value of the images of the tweet. Used for identifying matching images.
16. length, the length of the tweets in characters.
### User Item
1. id, the id of user.
2. screen name, the screen name of the user. For users in the positive dataset this is hashed.
3. subdataset, the subdataset this item belongs to which should be same as the `<subdataset>` of the file name.
4. class, the class of data where `positive` means `state-sponsored` and `negative` means `background`.
5. location, the list of self-reported locations of the user.
6. creation date, the date this account was created in the format `year-month-date`.
7. lang, the list of languages selected by the user.
8. \# min followers, the minimum number of followers of the user during our collection time period.
9. \# max followers, the maximum number of followers of the user during our collection time period.
10. \# min friends, the minimum number of friends of the user during our collection time period.
11. \# max friends, the maximum number of friends of the user during our collection time period.
12. profile, the list of profile descriptions of the user.
13. age (days), the age of the account which is calculated backwards from Dec 31st, 2018.

